Introduction

America is a highly diverse country. It is not only diverse in terms of ethnicity, but also in terms of income, industry, and law. This opens the doors for a variety of possible interactions between these variables. What factors drive the way that income is distributed in the United States? What factors reliably predict whether the average income per capita in a specific area is high or low? How does state level variations in law and freedom impact income?

How are these questions SMART?

These questions are important because they tell many facets of the story of consumption in the United States. Income serves both as a measure of productivity and lifetime consumption (although this analysis does not disentangle the two). Although their scope is broad, they remain specific to the concepts of income, demographics, and freedom, and maintain a consistent structure: how do demographics and freedom drive income in the United States, at the census tract level.

These questions also correspond to a set of highly measurable (And luckily, premeasured) variables. Income can be imputed from tax records, while ethnicity and work status are available from census forms. Achieving the answers to the questions is made simple by the cleanliness and availability of this data; since few data points are missing across all census tracts throughout the 50 states of interest, it is simple to form statistical tests.

Finally, these questions are relevant to policy makers who want to improve the incomes of their constituents as well as to researchers interested in establishing a baseline for the average income they should expect a community would earn based on its demographics. These are critical questions, because the ability of communities to support themselves economically has massive impacts on the wellbeing of their members.

Content

First, an examination is conducted on how the US Census Bureau database is structured, and which variables were included. Secondly, the groups of independent variables and how each of them could affect the income per capita of a community is presented. Then, an exploratory data analysis and some statistical tests are made to evaluate the significance of our variables. Finally, a conclusion looks into further challenges and questions necessary to enhance future analyses.

Dataset

U.S. Census Bureau Dataset

The U.S. Census Bureau Data holds the yearly American Community Survey: a project which asks Americans around the country about several dimensions of their lives, including work, income, demographics, and other activities (U.S. Census Bureau, 2019). The dataset from 2015 was available via Kaggle (MuonNeutrino, 2015), and included more than 74,000 observations, with 37 columns (variables). The dataset includes two variables related to income: the median household income and income per capita. The variable income per capita was prefered because it adjusts per person, and not per household given that it’s unknown how many people can live in an average household. The variable income per capita (IncomePerCap) is calculated as the average income per capita of the population of a specific census tract. But, what is a census tract and why use them?

Census tracts

Household’s income in America varies significantly by geographical location. The richest counties in the country are concentrated in urban areas near big metropolises where most businesses are located. The bay area in northern California, Northeast Virginia and New York are some examples. However, counties have been an insufficient unit to compare different variables among them. There are 3,142 counties in a country of 300 million inhabitants (U.S. Census Bureau, 2019), but among them are several inconsistencies. Texas, for example, has 254 counties (U.S. Census Bureau, 2017). California, a state with approximately 10 million people more than Texas, has only 58 counties (U.S. Census Bureau, 2017). Population-wise California has the largest county in the country with more than 10 million inhabitants (Los Angeles), whereas Texas has more than 80 counties with less than 10,000 people (U.S. Census Bureau, 2017). Density-wise, New York has 4 of 5 of the most dense counties in the country, some of them 60,000 times more dense than counties in Hawaii, Alaska or Nevada (U.S. Census Bureau, 2013). As a response to these inconsistencies found in counties in America, the U.S. Census Bureau delineated “Census Tracts” at the beginning of the twentieth century. A census tract is “geographic region defined for the purpose of taking a census.” Over the years, the U.S. Census Bureau has established census tracts in every county in America. There are over 74,000 census tracts in the country and a typical one has around 4,000 or so residents. There is a strength that comes from this consistency: census tracts are by and large similar in population size, and the population size of census tracts does not vary much from state to state.

Description of Variables

The complete dataset includes 17 independent variables and 1 dependent variable. Thanks to their nature, the independent variables were classified in three groups: Work Variation and Ethnic Variation.

Work Variation:

Professional: Percentage (%) employed in management, business, science, and arts in a census tract.

Service: Percentage (%) employed in service jobs in a census tract.

Office: Percentage (%) employed in sales and office jobs in a census tract.

Construction: Percentage (%) employed in natural resources, construction, and maintenance in a census tract.

Production: Percentage (%) employed in production, transportation, and material movement in a census tract.

Unemployed: Unemployment rate (%) in a census tract.

Self-employed: Percentage (%) self-employed in a census tract.

Ethnic Variation

Native: Percentage (%) of population that is Native American or Native Alaskan in a census tract.

White: Percentage (%) of population that is white in a census tract.

Black: Percentage (%) of population that is black in a census tract.

Hispanic: Percentage (%) of population that is Hispanic/Latino in a census tract.

Asian: Percentage (%) of population that is Asian in a census tract.

EDA

Population Histogram and QQ

[1] 0

Outliers identified: 3589 
Propotion (%) of outliers: 5.2 
Mean of the outliers: 74140.26 
Mean without removing outliers: 28491.23 
Mean if we remove outliers: 26139.73 
Outliers successfully removed 

[1] "4369.254"
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      5    2929    4086    4369    5460   53812 
[1] 2095.011

A baseline analysis of population and income was conducted. The histogram for population appeared skewed to the right. The different census tracts had similar population counts with a mean of about 4000. Counties were not evenly spread out as some had a population of 1 million and others 10 million. With similar populations, census tracts were easier to investigate instead of counties. The Q-Q plot confirmed the non-normality as the values between quartiles 3 and 4 were far away from the line.

Income Histogram and QQ

[1] 3589
[1] 0
[1] 69672
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    128   18776   24730   26140   32247   56040 
[1] 10274.98

Individual EDA of Work Variations

 Factor w/ 4 levels "[0,5.3]","(5.3,7.9]",..: 2 4 2 3 1 3 3 3 3 2 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   5.300   7.900   9.251  11.600 100.000     101 

 Factor w/ 4 levels "[0,23.7]","(23.7,31.7]",..: 3 1 2 2 4 2 1 4 2 2 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   23.70   31.70   33.23   41.80  100.00     105 

 Factor w/ 4 levels "[0,20.3]","(20.3,23.9]",..: 2 2 2 3 1 4 3 4 3 1 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   20.30   23.90   24.12   27.70  100.00     105 

 Factor w/ 4 levels "[0,14.1]","(14.1,18.3]",..: 2 4 4 3 2 2 4 1 2 1 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   14.10   18.30   19.65   24.00  100.00     105 

 Factor w/ 4 levels "[0,5.4]","(5.4,8.7]",..: 3 3 3 2 1 2 3 2 2 3 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   5.400   8.700   9.636  12.800 100.000     105 

 Factor w/ 4 levels "[0,7.7]","(7.7,12.3]",..: 3 4 3 3 3 3 3 1 3 4 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    7.70   12.30   13.36   17.80  100.00     105 

 Factor w/ 4 levels "[0,3.5]","(3.5,5.4]",..: 2 3 4 1 2 3 1 3 2 3 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   3.500   5.400   6.109   7.900 100.000     105 

Next the seven variables for work variations (professional, production, unemployment, office, service, construction, self-employed) were assessed for normality. The boxplots that exhibited a decrease in income, as more of the specific work variation was included in the census tract, were unemployment, service, construction, and production. That is to say, as more unemployed individuals were accounted for in a given census tract, the income per capita decreased. The only work variation that exhibited an increase in average income was professional work. The remaining variables of office and self-employed remained relatively stable across quartiles. Looking at the histograms of each of the variables it appeared that only the proportion of professionals was distributed normally. The remaining six work variations were all skewed to the right. For professionals, the Q-Q plots affirmed the normality as the plot did not have the error terms straying far from the line with very small right and left tails. The same cannot be said for the other variables as each had an oversized right tail and a relatively small left tail. Overall the proportion of professionals appeared normally distributed while the other work variations did not.

Individual EDA of ethnicities

 Factor w/ 4 levels "[0,0.8]","(0.8,4]",..: 3 4 4 2 4 3 4 3 3 3 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.80    4.00   13.78   15.32  100.00 

 Factor w/ 4 levels "[0,2.4]","(2.4,7.2]",..: 1 1 1 3 1 3 2 1 1 1 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    2.40    7.20   17.36   21.50  100.00 

 Factor w/ 4 levels "[0,0.1]","(0.1,1.2]",..: 2 3 3 1 3 1 1 1 1 2 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.100   1.200   4.347   4.400  91.300 

 Factor w/ 4 levels "[0,37.1]","(37.1,70.3]",..: 3 2 3 3 2 3 3 3 4 3 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   37.10   70.30   61.24   88.40  100.00 

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  0.0000   0.0000   0.0000   0.7567   0.4000 100.0000 

Finally the five ethnic variables (Native, White, Black, Hispanic, and Asian) were investigated. The boxplots for White showed an increase in average income between the first second and third quartiles but no change in the fourth. The boxplot for Asian showed an increase from the first through the fourth quartile. The boxplots for Hispanic slightly increased between the first and second quartile but did not change for the third quartile. The fourth quantile for Hispanic decreased significantly. The boxplot for Black increased in average income between the first and second quartile. Then there was a decrease in average income from the second to the fourth quartiles. Overall, it appeared that average income did change based on concentration of ethnicities in a census tract. The histogram for White was bimodal with the highest frequency at over 8,000. The histograms for the other four ethnicities were skewed to the right. Based on the histogram, it appeared that white had the highest responses followed by Hispanic, Black, Asian, and Native. All of the error terms along the Q-Q plot line for each of the ethnicity variables followed a curve with large left and right tails. Also, there were not enough responses from the Native ethnicity to construct a meaningful boxplot. For the native Q-Q plot, there was a clear pattern of the error terms along the line implying non-normality. Therefore, based on the assessment of the boxplots, histograms, and Q-Q plots, none of the ethnicities appear normally distributed.

'data.frame':   69672 obs. of  11 variables:
 $ Hispanic    : num  0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
 $ White       : num  87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
 $ Black       : num  7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
 $ Asian       : num  0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
 $ Professional: num  34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
 $ Service     : num  17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
 $ Office      : num  21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
 $ Construction: num  11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
 $ Production  : num  15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
 $ Unemployment: num  5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
 $ IncomePerCap: int  25713 18021 20689 24125 27526 30480 20442 32813 24028 24710 ...
[1] 626
[1] 0
[1] 11
'data.frame':   69567 obs. of  11 variables:
 $ Hispanic    : num  0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
 $ White       : num  87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
 $ Black       : num  7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
 $ Asian       : num  0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
 $ Professional: num  34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
 $ Service     : num  17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
 $ Office      : num  21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
 $ Construction: num  11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
 $ Production  : num  15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
 $ Unemployment: num  5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
 $ IncomePerCap: int  25713 18021 20689 24125 27526 30480 20442 32813 24028 24710 ...
 - attr(*, "na.action")= 'omit' Named int  1484 1807 2299 2499 2789 4259 4444 4448 4449 4477 ...
  ..- attr(*, "names")= chr  "1514" "1851" "2370" "2574" ...

PCA

             Hispanic  White  Black  Asian Professional Service Office
Hispanic        1.000 -0.657 -0.128  0.044       -0.331   0.272 -0.006
White          -0.657  1.000 -0.579 -0.260        0.346  -0.464 -0.019
Black          -0.128 -0.579  1.000 -0.100       -0.229   0.364  0.043
Asian           0.044 -0.260 -0.100  1.000        0.239  -0.043 -0.009
Professional   -0.331  0.346 -0.229  0.239        1.000  -0.609 -0.131
Service         0.272 -0.464  0.364 -0.043       -0.609   1.000 -0.145
Office         -0.006 -0.019  0.043 -0.009       -0.131  -0.145  1.000
             Construction Production Unemployment
Hispanic            0.254      0.109        0.212
White              -0.015     -0.100       -0.484
Black              -0.161      0.116        0.471
Asian              -0.221     -0.201       -0.091
Professional       -0.496     -0.651       -0.434
Service             0.028      0.117        0.448
Office             -0.238     -0.201        0.032
 [ reached getOption("max.print") -- omitted 3 rows ]
                 Hispanic       White       Black       Asian Professional
Hispanic      545.4011283 -476.691379  -66.397180   8.9465601   -104.28397
White        -476.6913785  965.201153 -398.350924 -69.8086312    144.90996
Black         -66.3971800 -398.350924  490.216058 -19.2051491    -68.18805
Asian           8.9465601  -69.808631  -19.205149  74.8157824     27.79961
Professional -104.2839673  144.909962  -68.188051  27.7996077    181.46453
Service        50.8005134 -115.378194   64.459823  -2.9669891    -65.72219
Office         -0.7871622   -3.437411    5.610685  -0.4386444    -10.31925
                 Service      Office Construction Production Unemployment
Hispanic       50.800513  -0.7871622    35.187809  19.085671    29.376338
White        -115.378194  -3.4374111    -2.840299 -23.253782   -89.160921
Black          64.459823   5.6106854   -21.167123  19.280331    61.880556
Asian          -2.966989  -0.4386444   -11.369500 -13.022798    -4.666184
Professional  -65.722186 -10.3192466   -39.714390 -65.704931   -34.680751
Service        64.143427  -6.7633651     1.343278   6.999397    21.278983
Office         -6.763365  34.1060908    -8.242961  -8.779620     1.095836
 [ reached getOption("max.print") -- omitted 3 rows ]
Importance of components:
                          PC1    PC2    PC3    PC4     PC5     PC6     PC7
Standard deviation     1.7878 1.3389 1.1653 1.0355 0.88819 0.82267 0.76933
Proportion of Variance 0.3196 0.1792 0.1358 0.1072 0.07889 0.06768 0.05919
Cumulative Proportion  0.3196 0.4989 0.6347 0.7419 0.82078 0.88845 0.94764
                           PC8     PC9     PC10
Standard deviation     0.71304 0.12303 0.003342
Proportion of Variance 0.05084 0.00151 0.000000
Cumulative Proportion  0.99849 1.00000 1.000000
                     PC1         PC2         PC3         PC4          PC5
Hispanic      0.29824667 -0.01361924  0.61175592 -0.22874996 -0.202137004
White        -0.41731343  0.37316978 -0.26669236 -0.02170213  0.001281164
Black         0.29691738 -0.34563121 -0.45079386  0.18397271 -0.035677561
Asian        -0.07602835 -0.38415017  0.43937912  0.18524481  0.610498550
Professional -0.46156573 -0.28644972  0.07425599  0.23265934 -0.203398965
Service       0.40338607 -0.11379011 -0.04473355  0.16595822 -0.238779921
Office       -0.02918079 -0.21748570 -0.16640164 -0.88655225  0.152863255
                     PC6          PC7          PC8          PC9          PC10
Hispanic      0.18631155 -0.393126083  0.003871628  0.504139978 -2.330305e-05
White        -0.30458284  0.004065075  0.229528223  0.685219633 -1.885794e-05
Black         0.34118441  0.222931142 -0.394503125  0.481982753 -1.032280e-05
Asian        -0.21635417  0.383917067  0.093476318  0.208880218 -1.178804e-05
Professional  0.29065544 -0.118425431  0.128229850 -0.001487139  6.992409e-01
Service      -0.73848184 -0.074734416 -0.118114735  0.007694364  4.157180e-01
Office       -0.05646975  0.132714775 -0.045973528 -0.002938736  3.031380e-01
 [ reached getOption("max.print") -- omitted 3 rows ]

      PC1               PC2                PC3               PC4           
 Min.   :-5.4230   Min.   :-8.10196   Min.   :-4.8516   Min.   :-13.66580  
 1st Qu.:-1.2752   1st Qu.:-0.88018   1st Qu.:-0.6341   1st Qu.: -0.62140  
 Median :-0.3129   Median : 0.02552   Median :-0.1991   Median :  0.02273  
 Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   :  0.00000  
 3rd Qu.: 1.0789   3rd Qu.: 0.95765   3rd Qu.: 0.5110   3rd Qu.:  0.64724  
 Max.   :10.4288   Max.   : 9.79218   Max.   : 6.3807   Max.   :  5.32333  
      PC5               PC6                PC7               PC8          
 Min.   :-4.4289   Min.   :-8.51468   Min.   :-5.8693   Min.   :-3.99610  
 1st Qu.:-0.5584   1st Qu.:-0.46560   1st Qu.:-0.4500   1st Qu.:-0.39751  
 Median :-0.0921   Median : 0.02195   Median :-0.0353   Median :-0.02785  
 Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000  
 3rd Qu.: 0.4421   3rd Qu.: 0.49022   3rd Qu.: 0.4223   3rd Qu.: 0.36326  
 Max.   : 8.3336   Max.   : 6.46528   Max.   :10.8685   Max.   :11.38794  
      PC9                PC10           
 Min.   :-2.08722   Min.   :-1.046e-02  
 1st Qu.:-0.01512   1st Qu.:-3.385e-05  
 Median : 0.02115   Median :-8.511e-06  
 Mean   : 0.00000   Mean   : 0.000e+00  
 3rd Qu.: 0.04651   3rd Qu.: 2.566e-05  
 Max.   : 0.29456   Max.   : 1.048e-02  
               PC1           PC2           PC3           PC4           PC5
PC1   1.000000e+00 -3.229650e-15 -2.721219e-16  1.871797e-15  4.359679e-15
PC2  -3.229650e-15  1.000000e+00  3.697138e-15 -2.377401e-15 -3.444515e-15
PC3  -2.721219e-16  3.697138e-15  1.000000e+00 -3.061978e-15 -2.156517e-17
PC4   1.871797e-15 -2.377401e-15 -3.061978e-15  1.000000e+00  3.035937e-15
PC5   4.359679e-15 -3.444515e-15 -2.156517e-17  3.035937e-15  1.000000e+00
PC6  -1.446371e-15  4.523205e-15  2.506401e-15 -4.865942e-15 -6.599124e-15
PC7   1.693192e-15 -5.820379e-15 -2.483534e-15  8.687901e-17 -2.721381e-15
               PC6           PC7           PC8           PC9          PC10
PC1  -1.446371e-15  1.693192e-15 -3.551669e-15  4.151794e-15  7.572904e-13
PC2   4.523205e-15 -5.820379e-15 -2.470873e-15 -6.254922e-14  1.718849e-13
PC3   2.506401e-15 -2.483534e-15  2.606976e-15 -1.085432e-13  4.459767e-13
PC4  -4.865942e-15  8.687901e-17  2.406897e-15  5.094873e-14 -1.366731e-12
PC5  -6.599124e-15 -2.721381e-15  5.057319e-15  3.157097e-14  1.250314e-12
PC6   1.000000e+00 -2.558483e-16 -9.819399e-16  9.526392e-15 -1.459345e-12
PC7  -2.558483e-16  1.000000e+00  2.078380e-15  6.076442e-14  3.275222e-13
 [ reached getOption("max.print") -- omitted 3 rows ]
               PC1           PC2           PC3           PC4           PC5
PC1   3.196104e+00 -7.730326e-15 -5.669127e-16  3.465187e-15  6.922661e-15
PC2  -7.730326e-15  1.792519e+00  5.768193e-15 -3.296035e-15 -4.096077e-15
PC3  -5.669127e-16  5.768193e-15  1.357952e+00 -3.694893e-15 -2.232046e-17
PC4   3.465187e-15 -3.296035e-15 -3.694893e-15  1.072297e+00  2.792276e-15
PC5   6.922661e-15 -4.096077e-15 -2.232046e-17  2.792276e-15  7.888895e-01
PC6  -2.127232e-15  4.981990e-15  2.402799e-15 -4.145235e-15 -4.821910e-15
PC7   2.328799e-15 -5.995129e-15 -2.226526e-15  6.921300e-17 -1.859571e-15
               PC6           PC7           PC8           PC9          PC10
PC1  -2.127232e-15  2.328799e-15 -4.527510e-15  9.131644e-16  4.524235e-15
PC2   4.981990e-15 -5.995129e-15 -2.358842e-15 -1.030283e-14  7.690273e-16
PC3   2.402799e-15 -2.226526e-15  2.166185e-15 -1.556136e-14  1.736707e-15
PC4  -4.145235e-15  6.921300e-17  1.777180e-15  6.490730e-15 -4.729472e-15
PC5  -4.821910e-15 -1.859571e-15  3.202911e-15  3.449838e-15  3.711072e-15
PC6   6.767830e-01 -1.619282e-16 -5.760047e-16  9.641750e-16 -4.011944e-15
PC7  -1.619282e-16  5.918759e-01  1.140136e-15  5.751318e-15  8.420314e-16
 [ reached getOption("max.print") -- omitted 3 rows ]
Importance of components:
                          PC1    PC2    PC3    PC4     PC5     PC6     PC7
Standard deviation     1.7878 1.3389 1.1653 1.0355 0.88819 0.82267 0.76933
Proportion of Variance 0.3196 0.1792 0.1358 0.1072 0.07889 0.06768 0.05919
Cumulative Proportion  0.3196 0.4989 0.6347 0.7419 0.82078 0.88845 0.94764
                           PC8     PC9     PC10
Standard deviation     0.71304 0.12303 0.003342
Proportion of Variance 0.05084 0.00151 0.000000
Cumulative Proportion  0.99849 1.00000 1.000000


Call:
lm(formula = IncomePerCap ~ ., data = pcadata_pcr_rot)

Residuals:
   Min     1Q Median     3Q    Max 
-57889  -3154   -136   3093  39355 

Coefficients:
            Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 26167.82      20.93 1250.463  < 2e-16 ***
PC1         -4585.05      11.71 -391.701  < 2e-16 ***
PC2         -1454.29      15.63  -93.043  < 2e-16 ***
PC3           604.54      17.96   33.664  < 2e-16 ***
PC4           994.55      20.21   49.214  < 2e-16 ***
PC5          -878.20      23.56  -37.274  < 2e-16 ***
PC6          1377.18      25.44   54.140  < 2e-16 ***
PC7          -205.74      27.20   -7.564 3.96e-14 ***
PC8          -196.99      29.35   -6.712 1.93e-11 ***
PC9          3301.06     170.10   19.407  < 2e-16 ***
PC10        -3519.28    6262.21   -0.562    0.574    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5519 on 69556 degrees of freedom
Multiple R-squared:  0.7102,    Adjusted R-squared:  0.7101 
F-statistic: 1.704e+04 on 10 and 69556 DF,  p-value: < 2.2e-16
Data:   X dimension: 69567 10 
    Y dimension: 69567 1
Fit method: svdpc
Number of components considered: 10

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
CV           10252     6157     5841     5799     5707     5653     5539
adjCV        10252     6157     5841     5799     5707     5653     5538
       7 comps  8 comps  9 comps  10 comps
CV        5536     5535     5520      5520
adjCV     5536     5535     5520      5520

TRAINING: % variance explained
              1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
X               31.96    49.89    63.47    74.19    82.08    88.85    94.76
IncomePerCap    63.93    67.54    68.01    69.02    69.60    70.82    70.84
              8 comps  9 comps  10 comps
X               99.85   100.00    100.00
IncomePerCap    70.86    71.02     71.02

K- Means

List of 9
 $ cluster     : Named int [1:69567] 2 2 2 2 2 1 2 1 2 2 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:2, 1:11] -0.329 0.174 0.42 -0.222 -0.32 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "1" "2"
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:2] 1.15e+12 1.34e+12
 $ tot.withinss: num 2.5e+12
 $ betweenss   : num 4.81e+12
 $ size        : int [1:2] 24086 45481
 $ iter        : int 1
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 2 clusters of sizes 24086, 45481

Cluster means:
    Hispanic      White      Black      Asian Professional    Service
1 -0.3285506  0.4200874 -0.3199683  0.2372900    0.9282858 -0.6150829
2  0.1739950 -0.2224715  0.1694500 -0.1256649   -0.4916051  0.3257379
       Office Construction Production Unemployment IncomePerCap
1 -0.01890396   -0.4302842 -0.6556146   -0.5268802     37598.33
2  0.01001123    0.2278715  0.3472029    0.2790272     20114.40

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 2  2  2  2  2  1  2  1  2  2  2  2  2  2  2  1  2  2  1  1  2  2  1  2  2  2 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 2  2  1  1  1  1  1  2  1  1  2  1  1  2  2  2  2  2  2  2  2  2  2  2  2  2 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
[1] 1.153649e+12 1.344211e+12
 (between_SS / total_SS =  65.8 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

List of 9
 $ cluster     : Named int [1:69567] 2 1 1 2 2 2 1 2 2 2 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:3, 1:11] 0.432 -0.24 -0.363 -0.575 0.34 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3] "1" "2" "3"
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:3] 4.33e+11 3.94e+11 3.93e+11
 $ tot.withinss: num 1.22e+12
 $ betweenss   : num 6.09e+12
 $ size        : int [1:3] 27175 29638 12754
 $ iter        : int 2
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 3 clusters of sizes 27175, 29638, 12754

Cluster means:
    Hispanic      White      Black       Asian Professional    Service
1  0.4318377 -0.5751979  0.3952287 -0.15378102   -0.7689991  0.6127621
2 -0.2397814  0.3399978 -0.2076592 -0.02410908    0.1329131 -0.2111302
3 -0.3629095  0.4354829 -0.3595529  0.38368702    1.3296436 -0.8149861
       Office Construction  Production Unemployment IncomePerCap
1 -0.02605785    0.2974405  0.51204618    0.6194822     16576.79
2  0.07327428    0.0108065 -0.07894429   -0.3101802     27821.96
3 -0.11475466   -0.6588701 -0.90756657   -0.5991303     42759.54

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 2  1  1  2  2  2  1  2  2  2  2  1  1  1  2  2  2  1  3  3  2  2  2  1  2  2 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 1  1  2  2  3  2  3  1  3  3  2  2  2  2  1  1  2  1  1  1  1  1  1  1  1  1 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 1  1  2  1  1  1  1  1  1  1  1  2  1  1  1  1  1  2  1  1  1  1  1 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
[1] 433025278355 393691127581 392888818743
 (between_SS / total_SS =  83.3 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

List of 9
 $ cluster     : Named int [1:69567] 4 2 4 4 4 1 4 1 4 4 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:4, 1:11] -0.302 0.668 -0.374 -0.132 0.407 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:4] "1" "2" "3" "4"
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:4] 1.76e+11 1.92e+11 1.73e+11 1.75e+11
 $ tot.withinss: num 7.15e+11
 $ betweenss   : num 6.6e+12
 $ size        : int [1:4] 17574 17698 8266 26029
 $ iter        : int 2
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 4 clusters of sizes 17574, 17698, 8266, 26029

Cluster means:
    Hispanic      White       Black      Asian Professional     Service
1 -0.3016974  0.4066104 -0.28173021  0.1074823    0.5711741 -0.43512866
2  0.6680011 -0.8809881  0.57590840 -0.1651807   -0.9293827  0.83254425
3 -0.3738848  0.4389987 -0.37976189  0.4559152    1.5331982 -0.92442949
4 -0.1317654  0.1850702 -0.08076331 -0.1050413   -0.2406168  0.02128076
       Office Construction Production Unemployment IncomePerCap
1  0.06629532   -0.2290380 -0.4319137  -0.46098899     32840.09
2 -0.05484069    0.3137655  0.5749681   0.89883620     14433.73
3 -0.17952039   -0.7693828 -1.0184110  -0.62970642     45780.77
4  0.04953752    0.1856318  0.2240905  -0.09992813     23412.84

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 4  2  4  4  4  1  4  1  4  4  4  4  4  2  4  1  4  2  1  1  4  4  1  2  4  4 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 2  4  1  1  3  1  1  4  1  3  4  1  1  4  4  4  4  4  2  2  2  2  2  4  4  2 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 2  4  4  2  4  4  4  2  2  2  4  4  4  2  2  4  4  4  2  4  2  4  4 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
[1] 175536566241 191571930523 172553341760 175374034150
 (between_SS / total_SS =  90.2 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

List of 9
 $ cluster     : Named int [1:69567] 5 1 1 1 5 5 1 4 1 5 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:5, 1:11] -0.00846 0.85002 -0.3769 -0.33227 -0.24884 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:5] "1" "2" "3" "4" ...
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:5] 9.11e+10 1.06e+11 9.67e+10 9.02e+10 8.75e+10
 $ tot.withinss: num 4.72e+11
 $ betweenss   : num 6.84e+12
 $ size        : int [1:5] 20760 12813 6192 11579 18223
 $ iter        : int 2
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 5 clusters of sizes 20760, 12813, 6192, 11579, 18223

Cluster means:
      Hispanic       White       Black       Asian Professional    Service
1 -0.008456987 -0.01428733  0.06737527 -0.12483068   -0.4552929  0.2049365
2  0.850017949 -1.08804866  0.68083922 -0.17734851   -1.0217088  0.9792310
3 -0.376900236  0.44391016 -0.39300833  0.48161313    1.6342530 -0.9824220
4 -0.332267736  0.42488521 -0.31698657  0.22115974    0.8637778 -0.5739916
5 -0.248840396  0.36049690 -0.22051300 -0.03726641    0.1329121 -0.2234518
       Office Construction  Production Unemployment IncomePerCap
1  0.02032925   0.25592838  0.38069044   0.09861782     20788.54
2 -0.07442558   0.32126962  0.59350367   1.09913874     13091.47
3 -0.21879800  -0.82200303 -1.06575585  -0.64529647     47520.79
4  0.02701923  -0.40030391 -0.64316728  -0.52575884     36236.88
5  0.08634809   0.01621363 -0.08018998  -0.33184070     27836.80

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 5  1  1  1  5  5  1  4  1  5  1  1  1  1  5  5  1  2  4  4  5  5  5  1  1  1 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 1  1  5  5  4  4  4  1  4  3  1  5  5  1  1  1  5  1  2  1  2  1  2  1  1  1 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 1  1  1  2  1  1  1  1  1  1  1  5  1  2  2  1  1  5  1  1  2  1  1 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
[1]  91107426890 106355408998  96685021931  90198216197  87531045107
 (between_SS / total_SS =  93.5 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
List of 9
 $ cluster     : Named int [1:69567] 5 1 1 1 5 5 1 4 1 5 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:5, 1:11] -0.00846 0.85002 -0.3769 -0.33227 -0.24884 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:5] "1" "2" "3" "4" ...
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:5] 9.11e+10 1.06e+11 9.67e+10 9.02e+10 8.75e+10
 $ tot.withinss: num 4.72e+11
 $ betweenss   : num 6.84e+12
 $ size        : int [1:5] 20760 12813 6192 11579 18223
 $ iter        : int 2
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 5 clusters of sizes 20760, 12813, 6192, 11579, 18223

Cluster means:
      Hispanic       White       Black       Asian Professional    Service
1 -0.008456987 -0.01428733  0.06737527 -0.12483068   -0.4552929  0.2049365
2  0.850017949 -1.08804866  0.68083922 -0.17734851   -1.0217088  0.9792310
3 -0.376900236  0.44391016 -0.39300833  0.48161313    1.6342530 -0.9824220
4 -0.332267736  0.42488521 -0.31698657  0.22115974    0.8637778 -0.5739916
5 -0.248840396  0.36049690 -0.22051300 -0.03726641    0.1329121 -0.2234518
       Office Construction  Production Unemployment IncomePerCap
1  0.02032925   0.25592838  0.38069044   0.09861782     20788.54
2 -0.07442558   0.32126962  0.59350367   1.09913874     13091.47
3 -0.21879800  -0.82200303 -1.06575585  -0.64529647     47520.79
4  0.02701923  -0.40030391 -0.64316728  -0.52575884     36236.88
5  0.08634809   0.01621363 -0.08018998  -0.33184070     27836.80

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 5  1  1  1  5  5  1  4  1  5  1  1  1  1  5  5  1  2  4  4  5  5  5  1  1  1 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 1  1  5  5  4  4  4  1  4  3  1  5  5  1  1  1  5  1  2  1  2  1  2  1  1  1 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 1  1  1  2  1  1  1  1  1  1  1  5  1  2  2  1  1  5  1  1  2  1  1 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
[1]  91107426890 106355408998  96685021931  90198216197  87531045107
 (between_SS / total_SS =  93.5 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

List of 9
 $ cluster     : Named int [1:69567] 6 4 4 6 6 2 4 2 6 6 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:6, 1:11] -0.349 -0.285 0.974 0.132 -0.385 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:6] "1" "2" "3" "4" ...
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:6] 5.33e+10 5.39e+10 6.79e+10 5.20e+10 5.61e+10 ...
 $ tot.withinss: num 3.36e+11
 $ betweenss   : num 6.98e+12
 $ size        : int [1:6] 8294 13096 9901 16290 4763 17223
 $ iter        : int 3
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 6 clusters of sizes 8294, 13096, 9901, 16290, 4763, 17223

Cluster means:
    Hispanic      White      Black       Asian Professional     Service
1 -0.3494717  0.4296992 -0.3360808  0.30654695    1.0897348 -0.68272359
2 -0.2854601  0.3976771 -0.2665621  0.05425283    0.4263884 -0.36762711
3  0.9739417 -1.2113147  0.7268560 -0.18905224   -1.0768116  1.07158297
4  0.1316581 -0.2288132  0.2195608 -0.13438783   -0.6075451  0.36540166
5 -0.3852568  0.4492850 -0.4005145  0.50197893    1.7117093 -1.02643109
6 -0.1925231  0.2792048 -0.1502203 -0.09190832   -0.1287054 -0.06945889
        Office Construction Production Unemployment IncomePerCap
1 -0.029446162   -0.5329557 -0.7839551   -0.5661097     38948.40
2  0.089116820   -0.1457015 -0.3276215   -0.4288697     31179.25
3 -0.088194644    0.3291525  0.5983303    1.2390235     12157.33
4  0.007625546    0.2812103  0.4727567    0.2799036     18933.00
5 -0.254726418   -0.8568239 -1.1023498   -0.6538796     48913.98
6  0.060350087    0.1491981  0.1403863   -0.1974674     24809.23

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 6  4  4  6  6  2  4  2  6  6  6  4  4  4  6  2  6  3  1  1  6  6  2  4  6  6 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 4  4  2  2  1  2  1  4  1  5  6  2  2  6  4  4  6  4  3  4  3  4  3  4  4  4 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 4  4  6  4  4  4  4  4  4  4  4  6  4  4  4  4  4  6  4  4  3  4  6 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
[1] 53289943534 53860735274 67861554081 51981016101 56119303347 52399097534
 (between_SS / total_SS =  95.4 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

List of 9
 $ cluster     : Named int [1:69567] 2 5 7 7 2 2 7 3 7 7 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:7, 1:11] -0.396 -0.258 -0.313 1.054 0.237 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:7] "1" "2" "3" "4" ...
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:7] 3.01e+10 3.39e+10 3.39e+10 5.06e+10 3.68e+10 ...
 $ tot.withinss: num 2.52e+11
 $ betweenss   : num 7.06e+12
 $ size        : int [1:7] 3586 13111 9445 8258 13680 6091 15396
 $ iter        : int 2
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 7 clusters of sizes 3586, 13111, 9445, 8258, 13680, 6091, 15396

Cluster means:
    Hispanic      White       Black      Asian Professional    Service
1 -0.3964506  0.4542494 -0.40764714  0.5282089    1.7925081 -1.0717282
2 -0.2576300  0.3733856 -0.23385201 -0.0269068    0.1755536 -0.2474022
3 -0.3125257  0.4200419 -0.30528640  0.1527982    0.6907586 -0.4893448
4  1.0539699 -1.2761352  0.73515083 -0.1940444   -1.1025060  1.1203953
5  0.2366030 -0.3958078  0.34172930 -0.1373108   -0.7010158  0.4859066
6 -0.3575666  0.4294729 -0.34914245  0.3685170    1.2778561 -0.7833195
       Office Construction Production Unemployment IncomePerCap
1 -0.29189615 -0.895033912 -1.1399021  -0.65954880     50244.25
2  0.08656937 -0.003686199 -0.1156669  -0.35053069     28266.51
3  0.06459196 -0.301790762 -0.5299635  -0.49757889     34274.90
4 -0.08602009  0.329565942  0.5903558   1.33577584     11570.50
5 -0.02053170  0.292153979  0.5252348   0.41200162     17787.86
6 -0.07512151 -0.640358239 -0.8939002  -0.59387071     41486.24
 [ reached getOption("max.print") -- omitted 1 row ]

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 2  5  7  7  2  2  7  3  7  7  7  7  5  5  7  2  7  5  3  3  2  2  2  5  7  7 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 5  5  2  2  6  3  3  7  3  6  7  2  3  7  5  5  2  5  4  5  5  5  4  5  7  5 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 5  5  7  5  5  7  7  5  5  5  5  2  5  5  5  7  5  2  5  7  4  5  7 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
[1] 30078367255 33918666154 33924690315 50620627295 36763050021 31979820579
[7] 34290620831
 (between_SS / total_SS =  96.6 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

List of 9
 $ cluster     : Named int [1:69567] 1 8 7 1 1 5 7 5 1 1 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:8, 1:11] -0.21 -0.398 -0.331 -0.359 -0.276 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:8] "1" "2" "3" "4" ...
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:8] 2.38e+10 2.24e+10 2.47e+10 2.28e+10 2.44e+10 ...
 $ tot.withinss: num 1.95e+11
 $ betweenss   : num 7.12e+12
 $ size        : int [1:8] 13055 3152 7742 5140 10623 6131 13192 10532
 $ iter        : int 2
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 8 clusters of sizes 13055, 3152, 7742, 5140, 10623, 6131, 13192, 10532

Cluster means:
    Hispanic       White       Black       Asian Professional     Service
1 -0.2099021  0.31065143 -0.17687237 -0.08368332  -0.08389187 -0.09881344
2 -0.3981141  0.45578358 -0.40960492  0.53311507   1.81785638 -1.08784127
3 -0.3314624  0.43062774 -0.31765082  0.20067007   0.83132143 -0.55740904
4 -0.3588452  0.42873072 -0.36110010  0.40713285   1.35698554 -0.82340123
5 -0.2764009  0.38868612 -0.25353852  0.02632578   0.34875186 -0.33062038
6  1.1604954 -1.34363273  0.72389507 -0.20529827  -1.11835632  1.17173776
       Office Construction Production Unemployment IncomePerCap
1  0.06138421    0.1289094  0.1064936  -0.23047133     25340.47
2 -0.30249732   -0.9092239 -1.1486967  -0.66205029     50788.01
3  0.03930789   -0.3810941 -0.6273035  -0.52118182     35892.73
4 -0.10289042   -0.6828886 -0.9379571  -0.61013558     42677.39
5  0.09089521   -0.1005031 -0.2648395  -0.40948506     30223.86
6 -0.08558404    0.3344531  0.5598170   1.46755772     10698.00
 [ reached getOption("max.print") -- omitted 2 rows ]

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 1  8  7  1  1  5  7  5  1  1  1  7  7  8  1  5  7  8  3  3  1  1  5  7  7  7 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 8  7  5  5  4  5  3  7  3  4  1  5  5  1  7  7  1  7  6  8  8  8  8  7  7  7 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 8  7  7  8  7  7  7  8  7  8  7  1  7  8  8  7  7  1  7  7  6  7  7 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
[1] 23806194051 22351110068 24674184543 22763698227 24425126210 32229516603
[7] 22734152115 22364091140
 (between_SS / total_SS =  97.3 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

List of 9
 $ cluster     : Named int [1:69567] 7 3 3 7 9 1 3 1 7 7 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:9, 1:11] -0.294 -0.347 0.049 -0.402 1.22 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:9] "1" "2" "3" "4" ...
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:9] 1.54e+10 1.52e+10 1.73e+10 1.58e+10 2.42e+10 ...
 $ tot.withinss: num 1.55e+11
 $ betweenss   : num 7.16e+12
 $ size        : int [1:9] 8025 5905 11753 2719 4994 4354 12230 9140 10447
 $ iter        : int 3
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 9 clusters of sizes 8025, 5905, 11753, 2719, 4994, 4354, 12230, 9140, 10447

Cluster means:
     Hispanic      White      Black       Asian Professional     Service
1 -0.29398811  0.4084707 -0.2881384  0.09801206    0.5412472 -0.41887898
2 -0.34732990  0.4326622 -0.3266366  0.26353963    0.9956851 -0.63591035
3  0.04903299 -0.1030093  0.1312846 -0.13421926   -0.5427295  0.27794329
4 -0.40241766  0.4583190 -0.4144673  0.54771279    1.8491328 -1.10706894
5  1.21957721 -1.3726251  0.7042742 -0.21976591   -1.1201758  1.19133726
6 -0.35966388  0.4293714 -0.3692219  0.42709539    1.4300252 -0.86183629
        Office Construction  Production Unemployment IncomePerCap
1  0.092171984 -0.214882278 -0.42684953   -0.4601316     32552.93
2 -0.003486057 -0.478374937 -0.72844821   -0.5525047     37694.40
3  0.017290823  0.273752980  0.44804925    0.1820710     19710.62
4 -0.310870838 -0.926207144 -1.16439186   -0.6642217     51363.70
5 -0.083922955  0.335218762  0.54026333    1.5507180     10154.85
6 -0.136970585 -0.719826670 -0.97229327   -0.6191718     43867.61
 [ reached getOption("max.print") -- omitted 3 rows ]

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 7  3  3  7  9  1  3  1  7  7  7  3  3  3  7  9  7  8  2  2  9  9  9  3  7  7 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 3  3  1  9  2  1  2  3  2  6  7  1  1  7  3  3  9  3  5  8  8  8  8  3  3  3 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 3  3  7  8  3  3  3  8  3  3  3  9  3  8  8  3  3  7  3  3  5  3  7 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
[1] 15416927509 15194151787 17298838728 15761413559 24236871148 16554689443
[7] 17160968891 17348442911 16348725566
 (between_SS / total_SS =  97.9 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

List of 9
 $ cluster     : Named int [1:69567] 9 2 8 9 5 5 8 10 9 9 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:10, 1:11] 1.293 0.172 -0.361 0.733 -0.269 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:10] "1" "2" "3" "4" ...
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:10] 1.67e+10 1.25e+10 1.39e+10 1.17e+10 1.19e+10 ...
 $ tot.withinss: num 1.28e+11
 $ betweenss   : num 7.18e+12
 $ size        : int [1:10] 3740 9868 3958 7455 8799 2488 5279 10782 10229 6969
 $ iter        : int 2
 $ ifault      : int 4
 - attr(*, "class")= chr "kmeans"
K-means clustering with 10 clusters of sizes 3740, 9868, 3958, 7455, 8799, 2488, 5279, 10782, 10229, 6969

Cluster means:
      Hispanic      White       Black        Asian Professional    Service
1   1.29347073 -1.3926502  0.65110535 -0.227144565  -1.10439711  1.2206491
2   0.17193979 -0.2995527  0.27181346 -0.131665132  -0.65475827  0.4157717
3  -0.36067127  0.4329611 -0.37594612  0.433931771   1.47068572 -0.8852975
4   0.73251794 -1.0447516  0.74170741 -0.163014767  -1.02683185  0.9384102
5  -0.26944247  0.3834500 -0.24218980 -0.002215569   0.27358175 -0.2989392
6  -0.40143948  0.4574908 -0.41624907  0.552865396   1.86667764 -1.1156001
         Office Construction  Production Unemployment IncomePerCap
1  -0.076264666   0.32954885  0.47907560   1.65962676      9446.23
2  -0.005854058   0.28803780  0.50879858   0.34460804     18266.13
3  -0.150092261  -0.74228280 -0.99238368  -0.63110451     44529.44
4  -0.088022816   0.32539421  0.65367907   0.92950639     14162.91
5   0.094618993  -0.05725246 -0.20068842  -0.38834330     29357.81
6  -0.323599903  -0.93488726 -1.16996086  -0.66198939     51688.15
 [ reached getOption("max.print") -- omitted 4 rows ]

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 9  2  8  9  5  5  8 10  9  9  8  8  2  2  9  5  8  4 10  7  9  9  5  2  8  8 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 2  2  5  5  7 10 10  8 10  3  8  5 10  8  2  2  9  2  4  2  4  2  4  2  8  2 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 2  2  8  4  2  8  8  2  2  2  8  9  2  4  2  8  2  9  2  8  4  2  8 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
 [1] 16668900913 12479477177 13910959117 11660948482 11895068060 12674139695
 [7] 12952720882 11959674961 11678714620 12133241415
 (between_SS / total_SS =  98.2 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

KNN

Preprocessing KNN

 Factor w/ 2 levels "[855,2.47e+04]",..: 2 1 1 1 2 2 1 2 1 1 ...
[1] "factor"
[1] "[855,2.47e+04]"     "(2.47e+04,5.6e+04]"
 Factor w/ 4 levels "[855,1.88e+04]",..: 3 1 2 2 3 3 2 4 2 2 ...
[1] "factor"
[1] "[855,1.88e+04]"      "(1.88e+04,2.47e+04]" "(2.47e+04,3.23e+04]"
[4] "(3.23e+04,5.6e+04]" 
'data.frame':   69567 obs. of  12 variables:
 $ Hispanic    : num  0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
 $ White       : num  87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
 $ Black       : num  7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
 $ Asian       : num  0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
 $ Professional: num  34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
 $ Service     : num  17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
 $ Office      : num  21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
 $ Construction: num  11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
 $ Production  : num  15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
 $ Unemployment: num  5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
 $ ipc2        : Factor w/ 2 levels "[855,2.47e+04]",..: 2 1 1 1 2 2 1 2 1 1 ...
 $ ipc4        : Factor w/ 4 levels "[855,1.88e+04]",..: 3 1 2 2 3 3 2 4 2 2 ...
[1] 0
[1] 0
[1] 12
'data.frame':   69567 obs. of  12 variables:
 $ Hispanic    : num  0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
 $ White       : num  87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
 $ Black       : num  7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
 $ Asian       : num  0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
 $ Professional: num  34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
 $ Service     : num  17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
 $ Office      : num  21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
 $ Construction: num  11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
 $ Production  : num  15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
 $ Unemployment: num  5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
 $ ipc2        : Factor w/ 2 levels "[855,2.47e+04]",..: 2 1 1 1 2 2 1 2 1 1 ...
 $ ipc4        : Factor w/ 4 levels "[855,1.88e+04]",..: 3 1 2 2 3 3 2 4 2 2 ...

KNN Model

Train-Test split 3:1

KNN 2 categories

Selecting the correct “k”

How does “k” affect classification accuracy? Let’s create a function to calculate classification accuracy based on the number of “k.”

 num [1:2, 1:15] 1 0.796 3 0.823 5 ...

Results

 Factor w/ 2 levels "[855,2.47e+04]",..: 1 1 1 1 1 2 2 1 1 1 ...
 - attr(*, "nn.index")= int [1:22836, 1:9] 31430 8744 21004 2152 14716 18952 43436 37471 18814 14542 ...
 - attr(*, "nn.dist")= num [1:22836, 1:9] 0.569 0.47 0.541 0.497 0.401 ...
[1] 22836
dat_pred_ipc2
[855,2.47e+04]           High 
         11177          11659 
                dat_ipc2.testLabels
dat_pred_ipc2    [855,2.47e+04] High
  [855,2.47e+04]           9492 1685
  High                     1940 9719
[1] 22836
[1] 9492 9719
[1] 0.8412594
      Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
  8.412594e-01   6.825270e-01   8.364545e-01   8.459773e-01   5.006131e-01 
AccuracyPValue  McnemarPValue 
  0.000000e+00   2.457036e-05 
         Sensitivity          Specificity       Pos Pred Value 
           0.8303009            0.8522448            0.8492440 
      Neg Pred Value            Precision               Recall 
           0.8336049            0.8492440            0.8303009 
                  F1           Prevalence       Detection Rate 
           0.8396656            0.5006131            0.4156595 
Detection Prevalence    Balanced Accuracy 
           0.4894465            0.8412729 

KNN 4 categories

Selecting the correct “k”

How does “k” affect classification accuracy? Let’s create a function to calculate classification accuracy based on the number of “k.”

 num [1:2, 1:15] 1 0.563 3 0.597 5 ...

Results

 Factor w/ 4 levels "[855,1.88e+04]",..: 1 2 2 1 2 3 3 1 1 2 ...
[1] 22836
dat_pred_ipc4
     [855,1.88e+04]             Mid-Low (2.47e+04,3.23e+04]  (3.23e+04,5.6e+04] 
               5284                5917                5721                5914 
                     dat_ipc4.testLabels
dat_pred_ipc4         [855,1.88e+04] Mid-Low (2.47e+04,3.23e+04]
  [855,1.88e+04]                4128     981                 159
  Mid-Low                       1293    3015                1440
  (2.47e+04,3.23e+04]            237    1483                2919
  (3.23e+04,5.6e+04]              65     230                1215
                     dat_ipc4.testLabels
dat_pred_ipc4         (3.23e+04,5.6e+04]
  [855,1.88e+04]                      16
  Mid-Low                            169
  (2.47e+04,3.23e+04]               1082
  (3.23e+04,5.6e+04]                4404
[1] 22836
[1] 4128 3015 2919 4404
[1] 0.6334735
      Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
  6.334735e-01   5.113148e-01   6.271853e-01   6.397277e-01   2.510510e-01 
AccuracyPValue  McnemarPValue 
  0.000000e+00   1.805656e-20 
                           Sensitivity Specificity Pos Pred Value
Class: [855,1.88e+04]        0.7213000   0.9324490      0.7812263
Class: Mid-Low               0.5281135   0.8305599      0.5095488
Class: (2.47e+04,3.23e+04]   0.5091575   0.8361691      0.5102255
Class: (3.23e+04,5.6e+04]    0.7765826   0.9120303      0.7446737
                           Neg Pred Value Precision    Recall        F1
Class: [855,1.88e+04]           0.9091272 0.7812263 0.7213000 0.7500681
Class: Mid-Low                  0.8407707 0.5095488 0.5281135 0.5186651
Class: (2.47e+04,3.23e+04]      0.8355828 0.5102255 0.5091575 0.5096909
Class: (3.23e+04,5.6e+04]       0.9251271 0.7446737 0.7765826 0.7602935
                           Prevalence Detection Rate Detection Prevalence
Class: [855,1.88e+04]       0.2506131      0.1807672            0.2313890
Class: Mid-Low              0.2500000      0.1320284            0.2591084
Class: (2.47e+04,3.23e+04]  0.2510510      0.1278245            0.2505255
Class: (3.23e+04,5.6e+04]   0.2483360      0.1928534            0.2589771
                           Balanced Accuracy
Class: [855,1.88e+04]              0.8268745
Class: Mid-Low                     0.6793367
Class: (2.47e+04,3.23e+04]         0.6726633
Class: (3.23e+04,5.6e+04]          0.8443065

Lasso Regression

[1]  11 100


Ridge lambda value at 50th percentile: 
[1] 11497.57

Ridge coefficients for lambda at 50th percentile: 
 (Intercept)     Hispanic        White        Black        Asian Professional 
  26167.8189    -689.1002     841.0879    -539.0475     479.8605    2185.3426 
     Service       Office Construction   Production Unemployment 
  -1484.7131    -283.9175    -885.1962   -1418.9224   -1231.2567 

Ridge MSE for lambda at 50th percentile : 
[1] 3616.177

Ridge lambda value at 60th percentile: 
[1] 705.4802

Ridge coefficients for lambda value at 60th percentile: 
 (Intercept)     Hispanic        White        Black        Asian Professional 
  26167.8189    -424.2205    1148.1987    -175.4403     575.0376    3268.5062 
     Service       Office Construction   Production Unemployment 
  -2208.5207    -697.1881   -1146.3519   -2030.7716   -1606.5620 

Ridge MSE for lambda at 60th percentile: 
[1] 5091.732
 (Intercept)     Hispanic        White        Black        Asian Professional 
  26167.8189     511.7676    2362.9118     735.6647     928.3809    3378.2481 
     Service       Office Construction   Production Unemployment 
  -2286.4616    -762.8110   -1152.9414   -2104.9176   -1614.7757 

Train and Test sets

[1] 824.8974
lowest lamda from CV:  824.8974 
MSE for best Ridge lamda:  30834392 

All the coefficients : 
 (Intercept)     Hispanic        White        Black        Asian Professional 
  26167.8189    -463.2954    1103.9441    -215.2605     564.0687    3247.7449 
     Service       Office Construction   Production Unemployment 
  -2194.5052    -687.5156   -1144.0593   -2020.6029   -1601.7077 

R^2: 
[1] 0.7065955

Lasso

lowest lamda from CV:  16.19307 
 MSE for best Lasso lamda:  30709528 

All the coefficients : 
 (Intercept)     Hispanic        White        Black        Asian Professional 
 26167.81892     13.56463   1690.00411    247.03605    712.23492   6030.24803 
     Service       Office Construction   Production Unemployment 
  -716.51986    367.51269      0.00000   -622.32649  -1612.95442 

The non-zero coefficients : 
 (Intercept)     Hispanic        White        Black        Asian Professional 
 26167.81892     13.56463   1690.00411    247.03605    712.23492   6030.24803 
     Service       Office   Production Unemployment 
  -716.51986    367.51269   -622.32649  -1612.95442 
[1] 0.7077836

lambda values are small so they do not deviate form the OLS much says 8 but has 9 most likely bc Hispanic has low coefficient. The effect of white and professional is much stronger than the other coefficients. e^5.5 =


Call:
lm(formula = IncomePerCap ~ ., data = datJLClean)

Residuals:
   Min     1Q Median     3Q    Max 
-57889  -3154   -136   3093  39355 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  26167.82      20.93 1250.463   <2e-16 ***
Hispanic       973.15      87.56   11.114   <2e-16 ***
White         2983.27     117.35   25.422   <2e-16 ***
Black         1175.86      84.20   13.966   <2e-16 ***
Asian         1115.18      41.58   26.817   <2e-16 ***
Professional   921.45    4378.81    0.210    0.833    
Service      -3752.36    2603.40   -1.441    0.149    
Office       -1839.03    1898.41   -0.969    0.333    
Construction -2235.99    1930.85   -1.158    0.247    
Production   -3487.94    2435.84   -1.432    0.152    
Unemployment -1604.28      26.65  -60.195   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5519 on 69556 degrees of freedom
Multiple R-squared:  0.7102,    Adjusted R-squared:  0.7101 
F-statistic: 1.704e+04 on 10 and 69556 DF,  p-value: < 2.2e-16

Call:
lm(formula = IncomePerCap ~ Hispanic + White + Black + Asian + 
    Professional + Service + Office + Production + Unemployment, 
    data = datJLClean)

Residuals:
   Min     1Q Median     3Q    Max 
-57889  -3155   -139   3092  39315 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26167.82      20.93 1250.46   <2e-16 ***
Hispanic       972.98      87.56   11.11   <2e-16 ***
White         2983.19     117.35   25.42   <2e-16 ***
Black         1175.89      84.20   13.97   <2e-16 ***
Asian         1115.17      41.58   26.82   <2e-16 ***
Professional  5991.86      53.93  111.11   <2e-16 ***
Service       -737.90      39.77  -18.55   <2e-16 ***
Office         359.14      28.63   12.54   <2e-16 ***
Production    -667.56      40.62  -16.44   <2e-16 ***
Unemployment -1604.14      26.65  -60.19   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5519 on 69557 degrees of freedom
Multiple R-squared:  0.7102,    Adjusted R-squared:  0.7101 
F-statistic: 1.894e+04 on 9 and 69557 DF,  p-value: < 2.2e-16

MSE for full model : 
[1] 30459848

MSE for full model (w/o construction) : 
[1] 30460435

Conclusion

Overall, this analysis found that there are several ways in which our independent variables reliably predict income in communities across the United States. The Freedom variables we drew from the Cato Institute performed poorest, with a high internal correlation and little predictive power. Ethnicity and work type proportions had stronger predictive power, with the latter having the most powerful effects.  However, these variables suffer from being largely non-normal, with a rightward skew, and from having high internal correlations, both between and within the two categories. Altogether, these variables allow us to predict income per capita at the census tract level with high reliability (R-squared = .67); this is actually quite impressive given the simplicity of this data. For instance, it does not directly include any information about the age or education of the population.
Moving forward, this analysis allows for several expansions. The first is to integrate new data, such as age and education status of census tract residents. Additionally, it may be valuable to consider each of the individual freedom measures on its own, to negate the influence of high internal correlation. Finally, it is interesting if there are differences driven by geographic density, which can be estimated with just the currently accessible data.

Bibliography

Cato Institute. (2018) Freedom In the Fifty States. UpToDate. Retrieved March 23, 2020, from https://www.freedominthe50states.org/how-its-calculated

MuonNeutrino. (2015). US Census Demographic Data: Demographic and Economic Data for Tracts and Counties. UpToDate. Retrieved March 23, 2020, from https://www.kaggle.com/muonneutrino/us-census-demographic-dataD

U.S. Census Bureau (2019). “Annual Estimates of the Resident Population for the United States, Regions, States, and Puerto Rico: April 1, 2010 to July 1, 2019”. 2010-2019 Population Estimates. United States Census Bureau, Population Division. December 30, 2019. Retrieved January 27, 2020.

U.S. Census Bureau (2017). “American FactFinder - Results”. U.S. Census Bureau. Retrieved 2017-12-13.

U.S. Census Bureau (2013). “2010 Census Summary File 1: GEOGRAPHIC IDENTIFIERS”. American Factfinder. US Census. Retrieved 18 October 2013.